Image Captioning


Image captioning is the process of generating a textual description of an image. It uses both Natural Language Processing (NLP) and Computer Vision (CV) to generate the captions.

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos

Add code
Jul 16, 2025
Viaarxiv icon

CATVis: Context-Aware Thought Visualization

Add code
Jul 15, 2025
Viaarxiv icon

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Add code
Jul 10, 2025
Viaarxiv icon

GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning

Add code
Jul 09, 2025
Viaarxiv icon

CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

Add code
Jul 08, 2025
Viaarxiv icon

Interpretable EEG-to-Image Generation with Semantic Prompts

Add code
Jul 09, 2025
Viaarxiv icon

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Add code
Jul 09, 2025
Viaarxiv icon

CLIP Won't Learn Object-Attribute Binding from Natural Data and Here is Why

Add code
Jul 10, 2025
Viaarxiv icon

CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

Add code
Jul 02, 2025
Viaarxiv icon

How Do Vision-Language Models Process Conflicting Information Across Modalities?

Add code
Jul 02, 2025
Viaarxiv icon